[CK DSL] CK DSL provider proof of concept for hipDNN#7916
Draft
DarylHawkinsAMD wants to merge 19 commits into
Draft
[CK DSL] CK DSL provider proof of concept for hipDNN#7916DarylHawkinsAMD wants to merge 19 commits into
DarylHawkinsAMD wants to merge 19 commits into
Conversation
Plan v0.8 for a new hipDNN engine plugin that exposes CK DSL-produced kernels via runtime JIT compilation. Captures decided architecture (embedded pybind11 interpreter, C++ adapter layer, in-memory JIT cache), the M1 milestone (implicit-GEMM conv with hipEvent timing and CpuFpReferenceConvolution verification), the implementation stream decomposition, and the resolved decision log so the next session can pick up from prep step P-1 without re-litigating context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Creates dnn-providers/ck-dsl-provider/ as a buildable-but-empty hipDNN engine plugin: CkDslConvImplicitGemmEngine reports no applicable graphs and the .so links cleanly through the SDK's EnginePluginImpl.inl macros. Wires the provider into the rocm-libraries superbuild as a new component and preset. Also updates plan v0.8 -> v0.9 to rename the single M1 engine after its op so the eventual per-spec engine fan-out (M5) is additive rather than a refactor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Links the plugin .so to libpython3 + pybind11 and initializes a per-process scoped_interpreter from CkDslContainer's ctor (the SharedContainerManager's per-process hook). The interpreter is heap- allocated and intentionally never finalized so sibling plugins that share CPython are unaffected at plugin unload. CMake pins discovery to the system Python (Python3_ROOT_DIR=/usr, FIND_STRATEGY=LOCATION) to avoid uv-managed Pythons in ~/.local that fail at runtime with a codec-of-the-filesystem-encoding error. Adds three gtest cases (InitializesOnce, CanImportStdlib, SurvivesGilReentry) under a new ck_dsl_provider_unit_tests target. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CompileServiceBridge owns the imported ck_dsl_provider.compile_service module and is constructed by CkDslContainer right after the embedded interpreter is up. sys.path is prepended (idempotently) with two CMake-baked absolute paths: the provider-local python/ directory and projects/composablekernel/python/, so both ck_dsl_provider and ck_dsl import cleanly without any packaging metadata or PYTHONPATH env. The generated ckdsl_provider_paths.h carries those literals. PythonError translates pybind11::error_already_set into a HipdnnPluginException, preserving the Python exception type and message. Bridge methods destroy the error object inside the GIL scope so the Python objects it owns release cleanly. For I-3 the bridge only exposes a noop_smoke() roundtrip that imports ck_dsl and returns its __file__; the real compile() entry point is the I-7 milestone. Three new gtest cases cover the happy path, error translation, and sys.path idempotence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CompileServiceBridge holds a py::module_ member. The defaulted dtor let that member decref its PyObject* on whatever thread tore the container down — at process exit that thread does not hold the GIL, which is undefined behaviour and asserts in CPython debug builds. Custom dtor acquires the GIL and clears the module before letting the member's default destructor run on an empty object. Also documents the sys.path-idempotence and bridge-thread-safety assumptions inline (reviewer minor notes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Carries forward across a push + re-clone on another node:
- STATUS.md resume-point handoff: HEAD, plan progress, next
step (I-4), gfx950 hardware constraint, gotchas.
- prep_findings/ synthesis of the §6.1 P-1..P-7 prep results that
everything from I-1 onward references.
- pybind11_rtld_local_spike/ reproducible source of the P-3 spike
that proved pybind11 works inside a hipDNN plugin
.so loaded RTLD_LOCAL.
WIP/.gitignore drops *.log and */build/ so subsequent runs do not
re-dirty the tree.
All of this can be deleted post-M1 once the durable docs live in
their permanent homes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the embedded compile service to a real HSACO load + launch on gfx950: compile_service.compile_smoke() builds an FP16 elementwise-copy kernel via ck_dsl, the C++ CompileServiceBridge translates the Python dict into a KernelArtifact, and HipModule + LaunchAbi load and launch it through hipModuleLoadData / hipModuleLaunchKernel. The new runtime/ tree (KernelArtifact, LaunchAbi, HipModule) follows the P-1 design memo: schema-driven arg packing replaces launcher.cpp's hardcoded per-kind layout, an explicit ldsBytes field closes the launcher's dynamic-LDS gap, and HipModule's RAII wrapper unloads the module on dtor and on a failed hipModuleGetFunction. HipModuleSmokeTest gates on hipGetDeviceCount() > 0 so the host-only CI lane stays green; on a gfx950 node both cases pass and the elementwise copy correctly writes FP16 1.0 (0x3C00) to its output.
SignatureHash -> shared_ptr<HipModule> map, mutex-guarded, with a getOrLoad(key, loader) that invokes the loader at most once per key for the cache's lifetime. M1 has no eviction policy and no negative cache; both are M2+ work per plan §3.4. The loader runs under the cache mutex so concurrent misses on the same key wait for the first compile rather than racing it. For M1 the provider is effectively single-threaded per handle so this lock is uncontended; a per-key shared_future scheme is the natural M2+ upgrade once concurrent compiles on distinct keys become a real workload. JitCacheSmoke validates the miss-then-hit shape end-to-end against a real compile via compileSmoke: the loader runs exactly once across two calls, both calls return the same HipModule, and the hit comes back in <10 ms while the miss takes hundreds of ms. Two host-only tests cover the empty-loader guard and the empty-cache miss case. Full suite: 12/12 green.
Pure-C++ ConvImplicitGemmSpec mirrors the ck_dsl ImplicitGemmConvSpec dataclass (13 ConvProblem fields + 22 spec fields). Constexpr defaults match the bake-off values from PREP_FINDINGS P-5, not the dataclass defaults: tile_k=64 (not 128), warp_tile_m/n=32 (not 16), warp_tile_k=16 (not 32), epilogue="cshuffle" (not "default"). lds_layout is omitted; the Python dataclass re-derives it from async_dma / lds_k_pad / tile_k. ConvImplicitGemmAdapter::buildSpec walks a single ConvolutionFwdAttributes + its tensor map and populates the ~15 graph-derived fields. Validates 2-D-only, FP16-only, symmetric padding, CROSS_CORRELATION, 4-D dims, matching X/W channel dim, and Y N/K consistency. All scalars narrowed from int64 to int32 with a range check (the DSL signature is i32). ConvImplicitGemmPayload::convImplicitGemmSpecToPayload translates the spec into the py::dict that I-7's compile_service.compile will consume via **payload splat. The dict is field-for-field with the dataclass so a divergence (extra/missing field) fails loudly at the dataclass ctor. Six host-only adapter unit tests cover the bake-off shape happy path plus the rejection paths (asymmetric padding, 3-D conv, true convolution mode, missing tensor, non-HALF dtype). Two payload tests exercise the GIL-aware translation and the round-trip-through-Python divergence canary that splats the C++ payload into the real ImplicitGemmConvSpec dataclass and reads back its derived props. Full suite: 20/20 green.
Wires the end-to-end miss/hit pipeline: hipDNN graph -> applicability check -> adapter -> signature -> JitCache -> bridge.compile -> Python compile_service -> HSACO -> HipModule. Execute() is still a stub (I-8 wires the launch). New surfaces: * `compile_service.compile(op_kind, payload)` dispatches by op_kind; for "conv_implicit_gemm" it splats the C++ payload through `ImplicitGemmConvSpec(problem=ConvProblem(**...), **rest)`, builds the kernel, returns the same dict shape as compile_smoke with the 6-arg conv kernel ABI (3 ptrs + 3 i32 bytes) and the canonical grid `(num_pid_n, num_pid_m, 1)` / block `(block_size, 1, 1)`. * `CompileServiceBridge::compile(opKind, payload)` factors the dict->KernelArtifact translation out of compileSmoke and exposes the production entry point. * `GraphSignature::computeForConvFwd` is an FNV-1a 64-bit hash over op_kind + dtypes + tensor dims + conv attrs + the provider/DSL git SHA (via CK_DSL_PROVIDER_VERSION_STRING). DSL changes invalidate every prior key by construction. * `HipModule` now retains the launch metadata (grid, block, ldsBytes, argSchema, kind) from the artifact at ctor time so the plan layer doesn't have to thread the artifact alongside the module. * `ConvImplicitGemmPlanBuilder` owns the per-op JitCache. isApplicable runs the adapter as the validator; buildPlan derives the signature, hits the cache, on miss compiles via the bridge with the payload from `convImplicitGemmSpecToPayload`, and stores a `ConvImplicitGemmPlan` (stub execute() until I-8) on the context. * `CkDslConvImplicitGemmEngine` now delegates all five IEngine virtuals to the plan builder. CkDslContainer replaces the static factory with a member createEngine that captures the bridge. Tests: 13 new cases (33 total, all green). 4 host-only plan-builder cases (isApplicable true/false, workspace, knobs), 2 GPU-gated (BuildPlanCachesOnSecondCall verifies the second buildPlan reuses the HipModule and completes in <50ms vs ~6.5s for the first compile; the launch metadata cross-checks plan §4: grid=(1,392,1), block=(256,1,1), 6-arg schema; PlanExecuteIsStub asserts the I-8 sentinel), 7 GraphSignature host-only (determinism + per-axis sensitivity to op_kind, shape, stride, padding, dtype, plus missing-tensor reject).
Replaces the I-7 stub with the real launch path: * findDeviceBuffer scans the deviceBuffers array for X/W/Y by uid (matching miopen-provider's pattern). Missing uid throws with the searched uid and array size in the message. * LaunchAbi::pack builds the 6-arg kernel buffer (3 device pointers + 3 i32 buffer-rsrc byte bounds) against the artifact's argSchema carried on the HipModule. * HipModule::launch fires hipModuleLaunchKernel via the handle's stream, using grid/block/ldsBytes captured from the artifact at load time. PlanBuilder computes xBytes/wBytes/yBytes from the spec's ConvProblem geometry at build time (N*H*W*C*2, K*R*S*C*2, N*Ho*Wo*K*2 for FP16) and embeds them in the plan -- the graph isn't in scope at execute() and the sizes are static per signature. Guards against the i32 overflow case (the kernel signature is i32 for A_bytes/B_bytes/D_bytes). Tests: PlanExecuteIsStub -> PlanExecuteLaunches (GPU-gated). Allocates real 3.2MB X/Y + 73KB W FP16 buffers, sets W = 0xab as sentinel for unwritten output, runs execute(), syncs, verifies zero input + zero weight -> zero output (i.e. the kernel actually ran). Plus a new ExecuteRejectsMissingDeviceBuffer host-only case (with a GPU skip because the buildPlan step still needs to compile + load the module into a real HIP module). Full suite: 34/34 green.
Self-contained warmup-and-iterate timing utility per PREP_FINDINGS P-7. Defaults: 5 warmup (matches launcher.cpp:559) + 50 timed (half of launcher.cpp's 100, keeps the integration test under ~1s of kernel time while keeping the median stable). Protocol: * Per-iter event pairs (one start + one stop per timed iter). All records issued on the supplied stream so HIP serialises them. * Single hipEventSynchronize on the final stop event drains all prior pairs at once -- no per-iter hipDeviceSynchronize. * Min + median microseconds reported. Median uses partial nth_element (one pass for odd N, two for even N) instead of a full sort. * TFLOPS = flops / median_seconds / 1e12 when flops > 0; zero otherwise so the smoke kernel (no defined arithmetic intensity) doesn't produce a spurious value. * Logging-only -- no perf-target assertions, matching the plan Q9 resolution that M1 logs absolute TFLOPS and defers comparison baselines to M2. API: PerfMeasurement::measure(launchFn, flops, stream) takes any callable (template, dispatched through std::function so the HIP machinery stays in the .cpp). PerfResult carries warmup/timed iter counts so the log line is self-describing. log() emits [CkDslPerf] tag=... warmup=N iters=M min_us=X.X median_us=Y.Y tflops=Z.Z through HIPDNN_PLUGIN_LOG_INFO. Tests (8 new, 42 total green): 6 host-only stats cases (odd-N median, even-N median, single sample, empty rejection, zero-timed- iters rejection, P-7 defaults) plus 2 GPU-gated end-to-end cases that drive the helper against the compileSmoke kernel and verify flops==0 -> tflops==0 and the flops formula round-trip.
End-to-end JIT-path integration test for the M1 capstone shape (N=8, 56x56x64 -> 64, 3x3, stride 1, pad 1, FP16, NHWC): 1. Builds a single-op conv-fwd FB graph via the test SDK helper. 2. Allocates Tensor<half> for X/W/Y with logical NCHW dims + NHWC physical strides via TensorLayout::NHWC (and KCRS strides for W). 3. Seeds X/W with deterministic random values in [-0.1, 0.1] so the K_gemm=576 accumulator stays in the numerically-friendly part of FP16. 4. Drives the JIT path through ConvImplicitGemmPlanBuilder (engine -> adapter -> signature -> bridge -> compile_service -> HSACO -> HipModule), executes the resulting plan via the same UID-keyed device-buffer interface the SDK uses at runtime. 5. Runs CpuFpReferenceConvolution::fprop on the same host-side tensors and asserts element-wise allclose at 5e-2 abs tolerance over the full ~1.6M-element NHWC output. The tolerance is loose relative to the analytical bound (~2.4e-4 for K=576 fp16 accumulations on uniform [-0.1, 0.1] inputs) so the test isn't brittle to minor codegen reshufflings. 6. PerfMeasurement(default 5/50) over plan.execute(), logs via HIPDNN_PLUGIN_LOG_INFO in the [CkDslPerf] format from P-7. Verified on this gfx950 (MI350-series): 131 TFLOPS median for the bake-off shape with the bake-off knobs (vs the example's 248 on MI300X; M2 autotuning will close the gap). **Adaptation from plan §1:** the test bypasses the hipDNN frontend API and the backend's .so-loading plugin path. Both surfaces are architecturally additive on top of what this test already proves -- the plan-builder + plan-execute path is the exact same code the backend would call after dlopen. Folding in the frontend Graph API + plugin loader is M1.5 / I-11 work. Full suite: 43/43 green, ~19s wall (15s for this test on a cold comgr).
Moves IntegrationGpuCkDslConvFp16 into a dedicated integration_tests/
subdir with its own gtest binary + main, and calls
finalize_test_targets("ck-dsl-provider") to register the prefixed
ctest targets the plan §6.2 I-11 advertises:
* ninja ck-dsl-provider-unit-check (label "unit_test", ~10 s)
* ninja ck-dsl-provider-integration-check (label "integration_test", ~9 s)
* ninja ck-dsl-provider-check (both lanes)
Plus the -verbose variants per the Tests.cmake pattern.
Tests split:
* tests/ 42 cases, all host-only or
GPU-gated, none requiring
the bake-off shape.
* integration_tests/ 1 case (BakeOffConv); the
end-to-end JIT-path
capstone from I-10.
pre-commit run over the full provider tree (every .cpp / .hpp / .py
/ CMakeLists.txt / .cmake / .json / .md / .in under
dnn-providers/ck-dsl-provider/) -- all hooks pass: trailing
whitespace, end-of-file, large-file, black, clang-format, cmake-lint.
WIP/STATUS.md updated to reflect M1 completion, including the
measured 131 TFLOPS on gfx950 (MI350-series) for the bake-off shape
and the M1.5 / M2 deferral list (frontend Graph API + .so loader
integration, autotuning, second op, on-disk cache).
Add an Architecture section with two Mermaid diagrams: a layered component view and an end-to-end request->compile->launch sequence. Both verified to render with mermaid-cli 11. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Convert IntegrationGpuCkDslConvFp16 from a single hardcoded BakeOffConv TEST_F into a parameterized TEST_P over a ConvCase shape list. The original bake-off shape is preserved as the BakeOff case. The bake-off shape is fully tile-aligned (M=N*Ho*Wo, GEMM-N=K, GEMM-K=C*R*S all multiples of the kernel's 64-wide tile), leaving partial-tile boundary handling unverified. The shape set now adds tile-aligned variants (stride 2, 1x1, C/K=128, non-square R!=S, dilation 2) and partial-tile probes (partial GEMM-N, GEMM-K, GEMM-M, and all three at once). All 10 shapes pass on gfx950 with worst abs diff 6e-5..2.4e-4 against the CPU reference (5e-2 tolerance). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Non-functional cleanups in ck-dsl-provider; no change to the JIT or launch behaviour. Graph signature: - Remove the legacy FlatBuffer-walking GraphSignature::computeForConvFwd. The cache key now derives solely from the adapter-built ConvImplicitGemmSpec via computeForSpec, keeping the adapter the single FlatBuffer reader and removing a parallel read path that could drift. - Rewrite GraphSignatureTest around computeForSpec: per-field and codegen-knob sensitivity, position-aliasing, optional-knob presence, and an adapter-built-spec round-trip. - Clarify the SignatureHash doc and spell out the on-disk-cache (M3) precondition: dtype/arch/layout must join the key before entries can be persisted. Naming: - Rename badParam -> throwBadParam in ConvImplicitGemmAdapter to reflect that it is [[noreturn]]. Dead code (zero consumers, confirmed against the provider, tests, and the plugin ABI): - EmbeddedInterpreter::importCheck -- I-3 bring-up scaffolding, unused since the integration test landed. - HipModule::kind() and its write-only _kind member. - The unused HipModule::launch(std::vector, ...) overload. - CkDslConvImplicitGemmEngine::planBuilderForTesting() -- a test seam with no test. Kept deliberately as forward-compat per dsl_docs/hipdnn_provider/plan.md: KernelArtifact::isa (M3 disk cache) and the I64/F32/F16 launch-ABI surface (M2/M5 ops). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a "Design plan" section to the ck-dsl-provider README pointing at the CK DSL hipDNN Provider plan as the source of truth for milestone scope, the runtime embedded-Python architecture, and resolved design questions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
a43ac83 to
353c954
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist